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ABSTRACT 

Summary: Computational genomics seeks to draw biological infer- 
ences from genomic datasets, often by integrating and contextualizing 
next-generation sequencing data. CGAT provides an extensive suite 
of tools designed to assist in the analysis of genome scale data from a 
range of standard file formats. The toolkit enables filtering, compari- 
son, conversion, summarization and annotation of genomic intervals, 
gene sets and sequences. The tools can both be run from the Unix 
command line and installed into visual workflow builders, such as 
Galaxy. 

Availability: The toolkit is freely available from http://github.com/ 
CGATOxford/cgat 
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1 INTRODUCTION 

A central task in computational genomics is to extract biologic- 
ally meaningful summaries and annotations from short read se- 
quences to facilitate both visualization and statistical analysis. 
Commonly, this process starts by mapping next-generation 
sequencing (NGS) reads and quantitating their distribution in 
genomic features such as transcripts with expression level and 
transcription factor binding sites with peak scores. This initial 
contextualization phase is well supported by specialized tools 
such as Tophat/Cufflinks (Trapnell et aL, 2012) or MACS 
(Feng et aL, 2012). In a second phase, datasets are typically 
integrated to allow interpretation, asking, for example, how 
many transcription factor binding sites are associated with 
each of exonic, intronic, flanking and intergenic genomic anno- 
tations. This phase necessarily relies on computational tools that 
can describe, integrate and summarize a variety of feature files 
produced from the initial phase and external annotation sources. 
Here we introduce a collection of tools that assist genomic sci- 
entists in successfully performing this crucial data integration 
and interpretation phase, bridging the gap from raw data to 
biologically interpretable results. We have made extensive use 
of these tools in a number of NGS projects (Long et aL, 2013, 
Rajan et aL, 2013, Ramagopalan et aL, 2010). 
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2 OVERVIEW 

The computational genomic analysis toolkit comprises >50 tools, 
each with documentation and examples. Tools are tagged to fa- 
cilitate discovery. Tags associate tools with broad themes (gen- 
omic intervals, gene sets, sequences), standard genomic file 
formats (BED, GTF, BAM, FASTA/Q) and the type of compu- 
tation performed by the tool, such as statistical summary, format 
conversion, annotation, comparison or filtering. 

As an illustrative example, a gene set can be annotated with the 
tool gtf2table. In fact, gtf2table provides >25 different methods to 
annotate transcript models. Annotation is dependent on auxiliary 
data: given a genome sequence, transcripts can be annotated by 
composition (e.g. %GC); given a reference gene set, transcripts 
can be marked as fragments or extensions, enabling the user to 
ascertain the completeness of transcript models built for RNAseq 
data. Given a BAM file with NGS read data, gtf2table can com- 
pute coverage in sense/antisense direction over transcript models; 
another example, bam2geneproflle computes and plots metagene- 
profiles from mapped NGS read data in BAM format (Fig. la). 
Different metagene models (with/without UTRs/introns, etc.) and 
various normalization options are available. Finally, the tool 
bam2peakshape computes read densities in specified genomic 
intervals to generate matrix data suitable for visualization in heat- 
maps (Fig. lb). The toolkit also contains standard sequence ana- 
lysis utilities such as fasta2table, which annotates sequences with 
CpG frequencies, codon frequencies and amino acid composition. 
To assist the interpretation of NGS data, the toolkit implements 
various classification schemes for transcript data or interval data. 
RNA-seq-derived transcripts can be marked as instances, frag- 
ments, extensions or alternative versions of transcripts in a refer- 
ence gene set. Chromatin immunoprecipitation- sequencing 
(ChlP-Seq) intervals can be marked as intronic, intergenic or 
within the UTR, upstream or downstream regions of transcript 
models. Finally, the toolkit provides tools to summarize genomic 
datasets, reporting the number of intervals or transcripts per 
chromosome, size distributions of features and more. 

3 USAGE 

We introduce the usage of the computational genomics analysis 
toolkit with a brief example. The fully worked example can be 
found online. Given a set of transcription factor binding intervals 
from a ChlP-seq experiment in BED format {nfkb.bed), we wish 
to determine how many binding intervals lie within exons, introns 
or intergenic sequence using a reference gene set from 
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Fig. 1. Visualization of the output of (a) bam2geneprofile and (b) 
bam2peakshape 

GENCODE (Harrow et ai, 2012), in GTF format (hgW.gtj). We 
then want to plot the density of binding relative to transcript 
models and examine the chromatin signatures within the intervals: 

cgat gtf2gtf-sort = gene<hgl9.gtf 
| cgat gtf2gtf-merge-exons-with-utr (1) 
| cgat gtf2gtf-filter = longest-gene (2) 

| cgat gtCgff-flank^SOOO-method^genome^^^woto^bra.g/T" 
cgat bed2table-filename-gff = annotations. gff 

-counter=classifier-chipseq<nfkb.bed>a««<9 tated _peaks. tsv^ 

The above sequence of Unix commands in turn (1) merges all 
exons of alternative transcripts per gene, (2) retains the longest 
gene in the case of overlapping genes and (3) annotates exonic, 
intronic, intergenic and flanking regions (size = 5 kb) within and 
between genes. Choosing different options can provide different 
sets of answers. Instead of merging all exons per gene, the longest 
transcript might be selected by replacing (2) with gtf2gtf - 
filter = longest-transcript. Note that the creation of annota- 
tions. gff goes beyond simple interval intersection, as gene struc- 
tures are normalized from multiple possible alternative 
transcripts to a single transcript that is chosen by the user de- 
pending on what is most relevant for the analysis. 

The generated annotations are then used to classify the tran- 
scription factor binding sites using bed2 table (4). The profile of 
ChlP-seq binding over genes can be calculated and plotted using 
bam2geneprofile (Fig. la). Chromatin state at ChlP-seq peaks 
can be investigated by integrating H3K4mel and H3K4me3 
data for a relevant tissue (ENCODE Project Consortium et ai, 
2012) using bam2peakshape and plotted in R (R Core Team, 
2012) (Fig. lb). Statistical significance can be assessed using 
tools such as GAT (Heger et ai, 2013). More usage examples, 
including testing for functional enrichment, assessment of CpG 
content in long non-coding RNA promoters and clustering meta- 
genomic contigs on tetranucleotide frequency, can be found 
online. 



4 IMPLEMENTATION 

We aim to write legible and maintainable code that can serve as 
an entry point into computational methods for biologists. The 
toolkit is implemented in the Python language (van Rossum, 
1995). Some performance-critical sections have been imple- 
mented in Cython (Behnel et al., 2011). The toolkit can be 



installed from common Python package repositories. 
Dependencies will be installed automatically, although some 
tools require external software to be installed. All tools are 
freely available under the BSD 3 -clause licence. The toolkit is 
under constant development, and community involvement in the 
project is welcome. Regression tests ensure that core functional- 
ity is maintained as scripts are extended. All tools are built using 
a common coding style and follow a naming scheme centred on 
common genomic file formats. The tools have a consistent com- 
mand line interface enabling them to be combined into work 
flows using Unix pipes and integrated into automated pipelines 
allowing automated and parallel execution. They use a consistent 
logging mechanism to facilitate issue tracking. Furthermore, the 
use of common genomic formats means that tools can be easily 
combined with other popular genomic software such as 
BEDtools (Quinlan and Hall, 2010), University of California, 
Santa Cruz tools (Kuhn et al, 2013) or biopieces, http://www. 
biopieces.org. An RDF (Resource Description Framework, 
http://www.w3.org/RDF) description of each tool can be gener- 
ated for use with tools such as CLI-MATE (Tatum et al, 2011) 
to generate XML files for a variety of workflow frameworks, 
such as Galaxy (Goecks et al, 2010). 
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